Industrialization and manufacturing growth in China has been accompanied by major growth in pollution. Key pollutants include fine particulate matter (PM2.5 and PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), surface ozone (O3), and carbon monoxide (CO). The US EPA’s Air Quality Index (AQI) is one index of air quality based on these six pollutants. To attain a “good” AQI based on PM2.5, the 24-hour average has to be less than 12 μg/m^3; a 24-hour average PM2.5-level of at least 55.5 μg/m^3 is considered “unhealthy,” while levels above 250.5 μg/m3 are considered “hazardous.” The data here come from Beijing, notable for poor air quality.
The file beijing.csv contains 420,768 hourly measurements of PM2.5 and weather-related factors measured at 12 monitoring stations around Beijing from March 1, 2013 through February 28, 2017. These files have not been cleaned. The data are as follows, with all pollutants measured in micrograms per cubic meter (μg/m^3).
a. We have data at an hourly scale, but there are a couple derived attributes related to date that we may find useful: * Add a column for “day of year” (for example, January 3rd is the third day of the year) using the yday function in the lubridate package. * Add a column for “weekday” using your method of choice. Hint: format( , "%A") may be useful.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
beijing <- read.csv("beijing.csv")
as_dates = as.Date(ISOdate(beijing$year, beijing$month, beijing$day))
beijing$weekday <- format(as_dates, "%A")
beijing$day_of_year <- yday(as_dates)
b. Each measurement is associated with a weather station. The location of each station in is stations.csv. Inner join the information from that file into our air quality data.
stations <- read.csv("stations.csv")
beijing <- inner_join(beijing, stations, by=c("station"))
c. This data is currently in a wide format with respect to the pollutants (one column for each pollutant), but we would like to compare trends across pollutants. Ideally we would have figures that summarizes ALL our data, while splitting up by pollutant. This calls for a long format. Use the pivot_longer function from the tidyverse to make a new data frame with a row for each pollutant, and a column titled pollutant for the name of the pollutant and concentration for the concentration of it.
beijing_long <- pivot_longer(beijing, c("PM2.5", "PM10", "SO2", "NO2", "CO", "O3"), names_to="pollutant", values_to="concentration")
a. Build a single figure that describes the distribution of the pollutants depending on the day of the week. To make your figure more informative, consider the order that the weekdays are displayed. For this first figure, do not use facet_wrap or facet_grid and instead set the x aesthetic to pollutant.
beijing_long$weekday <- factor(beijing_long$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
geom_boxplot() +
ylab("Concentration")
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
b. Wow that figure looks pretty rough since the pollutants concentrations vary not only across pollutants, but also within due to the huge amount of outliers! Make the same figure but use a log scale. What base of the logarithm should you use?
ggplot(beijing_long, aes(pollutant, concentration, fill=weekday)) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
ylab("Concentration")
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
c. Now make the same graph but use facet_wrap or facet_grid. Is this figure, or the one above more informative (your opinion)?
ggplot(beijing_long, aes(y = concentration, fill=weekday)) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
ylab("Concentration") +
facet_wrap(~pollutant)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
d. The distributions look much easier to visualize now while not comprimising the conclusions we can draw. Does the day of the week seem to make a difference?
e. Make a similar figure with a log scale, but showing the change in pollutants by month. It is your choice if you would like to facet or not. Describe some patterns you see in the data.
ggplot(beijing_long, aes(pollutant, concentration, fill=factor(month))) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
ylab("Concentration") +
scale_fill_discrete(labels=month.abb)
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
f. Finally make a similar figure, but on an hourly scale. How could this plot be improved?
ggplot(beijing_long, aes(pollutant, concentration, fill=factor(hour))) +
geom_boxplot() +
scale_y_continuous(trans='log10') +
ylab("Concentration")
## Warning: Removed 70303 rows containing non-finite values (stat_boxplot).
a. Beijing residents believe that the worst air quality occurs during easterly winds. Make a data frame of summary statistics describing the relative magnitude of each pollutant by wind direction. Your data frame should allow someone to quickly answer the following questions:
Hint: You may find the following code chunk helpful.
wind_direction_info <- beijing_long %>%
drop_na(wd, pollutant) %>%
group_by(wd, pollutant) %>%
summarise(
mean_speed = mean(WSPM),
mean_value = mean(na.omit(concentration)),
angle=wind_direction_map[first(wd)]
)
## `summarise()` has grouped output by 'wd'. You can override using the `.groups` argument.
max_mean_by_pollutant <- wind_direction_info %>%
group_by(pollutant) %>%
summarise(max_mean = max(mean_value))
wind_direction_info <- inner_join(wind_direction_info, max_mean_by_pollutant, by=c("pollutant"))
b. Build a single figure using polar coordinates which describes the average relative magnitude for each pollutant when grouped by wind direction. Account for wind speed in some way in your figure.
ggplot(wind_direction_info, aes(x=angle, y=mean_value/max_mean, fill=mean_speed)) +
geom_bar(width=22.5, stat="identity", color="white") +
scale_fill_gradient(low="blue", high="red") +
coord_polar(start = -pi/16) +
theme(
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
axis.line = element_blank(),
legend.position="bottom",
legend.key.width = unit(1.5, "cm")
) +
geom_text(aes(angle, 1.1, label = wd), color="black", size=2) +
ggtitle("Average of Pollutant vs. Direction") +
guides(fill=guide_colorbar(title="Wind Speed (m/s)")) +
facet_wrap(~pollutant)
c. Does your visualization support the belief of Beijing residents?
Beijing experienced one of the worst air quality crises in Chinese history in January 2013, with the entire city covered in a dense gray haze visible from space. This experience led the Chinese government to immediately draft an air pollution control plan that was implemented effective September 2013. We will build a visualization similar to the one below:
a. To make our graph a little easier to build, let’s aggregate our data by day. Make another data frame that summarizes each pollutant/day pair by the mean of each pollutant across all of Beijing. Include the day of year attribute from earlier in your summary dataframe.
Hint: Your new dataframe should have at least the following column names:
“year”, “month”, “day”, “pollutant”, “average_concentration”, “day_of_year”
beijing_by_day <- beijing_long %>%
group_by(year, month, day, pollutant) %>%
summarise(average_concentration = mean(na.omit(concentration)),
day_of_year = first(day_of_year))
## `summarise()` has grouped output by 'year', 'month', 'day'. You can override using the `.groups` argument.
1b. Create the plot mentioned above, but use some type of smoothing. Because the data is so erratic, it helps to visualize the trends by using some type of rolling average. Use rollmean, rollmax, rollmedian, or rollsum from the zoo package to do some type of smoothing. Pick an appropriate smoothing window. You may need to use na.pad=TRUE for the zoo functions.
Hint: Use the “day of year” column defined earlier.
Bonus: Format the x-axis ticks to display an appropriate date format.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ggplot(beijing_by_day, aes(x=day_of_year,
y=rollmedian(average_concentration, 14, na.pad=TRUE),
color=factor(year),
alpha=factor(year))) +
geom_line() +
facet_wrap(~pollutant) +
xlab("Day of Year") +
ylab("Log of 14-day rolling median for pollutant daily mean") +
scale_y_continuous(trans="log10") +
scale_alpha_manual(values=c(1, 0.15, 0.15, 0.15, 0.15))
## Warning in runmed(x, k, ...): 'k' must be odd! Changing 'k' to 15
## Warning in NextMethod("[<-"): number of items to replace is not a multiple of
## replacement length
## Warning: Removed 3 row(s) containing missing values (geom_path).
c. What visual evidence (if any) is there that the plan in 2013 was effective? Justify your choice of smoothing function and window.
We will now animate the data while using ggmap. Pick a map from the ./maps folder. You can see what a map looks like with code similar to the below chunk:
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
load(file = "./maps/toner.RData") # sets map variable
ggmap(map)
Justify your map choice.
library(ggvoronoi)
single_measurement <- beijing_long %>%
filter(year==2015, month==3, day==2, hour==3, pollutant=="PM2.5")
map_for_plotting <- ggmap(map, extent="device", legend="bottomright")
map_for_plotting + geom_segment(
data=single_measurement,
aes(x=long, y=lat,
xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/100,
yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/100, color=concentration),
arrow=arrow(length=unit(0.1, "cm"))) +
scale_color_continuous(low="blue", high="red")
map_for_plotting +
geom_segment(
data=single_measurement,
aes(x=long, y=lat,
xend=long + cos(wind_direction_map[wd] * pi/180) * WSPM/100,
yend=lat + sin(wind_direction_map[wd] * pi/180) * WSPM/100, color=concentration),
arrow=arrow(length=unit(0.1, "cm"))) +
scale_color_continuous(low="blue", high="red") +
geom_voronoi(
data=single_measurement,
aes(x=long, y=lat, fill=TEMP),
alpha=0.1) +
# outline=data.frame(
# x=c(115.9612, 115.9612, 116.8401, 116.8401, 115.9612),
# y=c(39.7625, 40.43479, 40.43479, 39.7625, 39.7625),
# group=c(1,1,1,1,1)
# )
# stat_voronoi(
# data=single_measurement,
# aes(x=long, y=lat),
# geom = "path") +
# outline=data.frame(
# x=c(115.9612, 115.9612, 116.8401, 116.8401, 115.9612),
# y=c(39.7625, 40.43479, 40.43479, 39.7625, 39.7625),
# group=c(1,1,1,1,1)
# )
# ) +
scale_fill_continuous(low="blue", high="red")